Protein - Structure Mapping, Alignments, and Visualization

This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.

**Input:** Protein ID + amino acid sequence + mutated sequence(s)

**Output:** Representative protein structure, sequence alignments, and visualization of mutations

Imports



In [1]:

    
import sys
import logging



In [2]:

    
# Import the Protein class
from ssbio.core.protein import Protein



In [3]:

    
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don't affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!



In [4]:

    
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #



In [5]:

    
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROTEIN_ID will be created
PROTEIN_ID
- Your protein ID
PROTEIN_SEQ
- Your protein sequence

A directory will be created in ROOT_DIR with your PROTEIN_ID name. The folders are organized like so:

    ROOT_DIR
    └── PROTEIN_ID
        ├── sequences  # Protein sequence files, alignments, etc.
        └── structures  # Protein structure files, calculations, etc.



In [6]:

    
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'



In [7]:

    
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')



In [8]:

    
# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)









    Out[8]:





SeqProp(seq=Seq('MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPY...WLE', SingleLetterAlphabet()), id='WT', name='WT', description='WT <unknown description>', dbxrefs=[])

Mapping sequence --> structure

Since the sequence has been provided, we just need to BLAST it to the PDB.

**Note:** These methods do not download any 3D structure files.

Methods



In [9]:

    
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()









    Out[9]:





['2zyd', '2zya', '3fwn', '2zyg']






    Out[9]:







  
    
      
      pdb_chain_id
      hit_score
      hit_evalue
      hit_percent_similar
      hit_percent_ident
      hit_num_ident
      hit_num_similar
    
    
      pdb_id
      
      
      
      
      
      
      
    
  
  
    
      2zya
      A
      2319.0
      0.0
      0.987179
      0.963675
      451
      462
    
    
      2zya
      B
      2319.0
      0.0
      0.987179
      0.963675
      451
      462
    
    
      2zyd
      A
      2319.0
      0.0
      0.987179
      0.963675
      451
      462
    
    
      2zyd
      B
      2319.0
      0.0
      0.987179
      0.963675
      451
      462
    
    
      2zyg
      A
      2284.0
      0.0
      0.982906
      0.950855
      445
      460

Downloading and ranking structures

Methods

**Warning:** Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.



In [10]:

    
# Download all mapped PDBs and gather the metadata
my_protein.download_all_pdbs()
my_protein.df_pdb_metadata.head(2)









    Out[10]:





['2zyd', '2zya', '3fwn', '2zyg']






    



[2019-08-10 23:39] [ssbio.core.protein] WARNING: Not showing info for mapped PDB entries, as they have not been downloaded yet. Run download_all_pdbs






    Out[10]:







  
    
      
      mapped_chains
      structure_file
    
    
      pdb_id
      
      
    
  
  
    
      2zya
      A;B
      2zya.mmtf
    
    
      2zyd
      A;B
      2zyd.mmtf



In [11]:

    
# Set representative structures
my_protein.set_representative_structure()









    Out[11]:





<StructProp REP-2zyd at 0x122cbd630>

Loading and aligning new sequences

You can load additional sequences into this protein object and align them to the representative sequence.



In [12]:

    
my_protein.__dict__









    Out[12]:





{'id': 'SRR1753782_00918',
 'description': None,
 'notes': {},
 'pdb_file_type': 'mmtf',
 '_root_dir': '/var/folders/0c/_6vlxjxj4tj78vkyl_3z9p0h0000gn/T',
 'sequences': [SeqProp(seq=Seq('MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPY...WLE', SingleLetterAlphabet()), id='WT', name='WT', description='WT <unknown description>', dbxrefs=[])],
 'representative_sequence': SeqProp(seq=Seq('MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPY...WLE', SingleLetterAlphabet()), id='WT', name='WT', description='WT <unknown description>', dbxrefs=[]),
 'structures': [<PDBProp 2zyd at 0x110bbb668>,
  <PDBProp 2zya at 0x110bbb438>,
  <PDBProp 3fwn at 0x122cd3c18>,
  <PDBProp 2zyg at 0x122cd3358>,
  <StructProp REP-2zyd at 0x122cbd630>],
 'representative_structure': <StructProp REP-2zyd at 0x122cbd630>,
 'representative_chain': 'A',
 'representative_chain_seq_coverage': 0.9572649572649573,
 'sequence_alignments': [<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 122dc2240>,
  <<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 12370fe10>],
 'structure_alignments': []}

Methods



In [13]:

    
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)









    Out[13]:





SeqProp(seq=Seq('MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPY...WLE', ExtendedIUPACProtein()), id='N17P_SNP', name='<unknown name>', description='<unknown description>', dbxrefs=[])



In [14]:

    
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)









    Out[14]:





SeqProp(seq=Seq('MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPY...WLE', ExtendedIUPACProtein()), id='Q4S_N17P_SNP', name='<unknown name>', description='<unknown description>', dbxrefs=[])



In [15]:

    
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()



In [20]:

    
# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]

# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('WT_Q4S_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)









    Out[20]:





['WT_2zyd-A', 'WT_2zyd-B', 'WT_N17P_SNP', 'WT_Q4S_N17P_SNP']






    Out[20]:





{'identity': 466,
 'similarity': 466,
 'gaps': 0,
 'score': 2376.0,
 'percent_identity': 99.6,
 'percent_similarity': 99.6,
 'percent_gaps': 0.0,
 'a_seq': 'WT',
 'b_seq': 'Q4S_N17P_SNP',
 'ssbio_type': 'seqalign',
 'mutations': [('Q', 4, 'S'), ('N', 17, 'P')],
 'deletions': [],
 'insertions': []}






    Out[20]:





'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'






    Out[20]:





'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'



In [21]:

    
# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f









    



Single mutations:






    Out[21]:





{('N', 17, 'P'): ['N17P_SNP', 'Q4S_N17P_SNP'], ('Q', 4, 'S'): ['Q4S_N17P_SNP']}






    



---------------------
Mutation fingerprints






    Out[21]:





{(('N', 17, 'P'),): ['N17P_SNP'],
 (('Q', 4, 'S'), ('N', 17, 'P')): ['Q4S_N17P_SNP']}

Some additional methods

Getting binding site/other information from UniProt



In [22]:

    
import ssbio.databases.uniprot



In [23]:

    
this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features









    Out[23]:





[SeqFeature(FeatureLocation(ExactPosition(9), ExactPosition(15)), type='Nucleotide binding'),
 SeqFeature(FeatureLocation(ExactPosition(32), ExactPosition(35)), type='Nucleotide binding'),
 SeqFeature(FeatureLocation(ExactPosition(73), ExactPosition(76)), type='Nucleotide binding'),
 SeqFeature(FeatureLocation(ExactPosition(127), ExactPosition(130)), type='Region'),
 SeqFeature(FeatureLocation(ExactPosition(185), ExactPosition(187)), type='Region'),
 SeqFeature(FeatureLocation(ExactPosition(182), ExactPosition(183)), type='Active site'),
 SeqFeature(FeatureLocation(ExactPosition(189), ExactPosition(190)), type='Active site'),
 SeqFeature(FeatureLocation(ExactPosition(101), ExactPosition(102)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(101), ExactPosition(102)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(259), ExactPosition(260)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(286), ExactPosition(287)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(444), ExactPosition(445)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(450), ExactPosition(451)), type='Binding site'),
 SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(468)), type='Chain', id='PRO_0000090047')]

Mapping sequence residue numbers to structure residue numbers

Methods



In [24]:

    
# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45], 
                                                                       use_representatives=True)
structure_sites









    



[2019-08-10 23:39] [ssbio.core.protein] WARNING: REP-2zyd-A, 1: structure file does not contain coordinates for this residue






    Out[24]:





{3: 3, 45: 45}

Viewing structures

The awesome package nglview is utilized as a backend for viewing structures within a Jupyter notebook. ssbio view functions will either return a NGLWidget object, which is the same as using nglview like the below example, or act upon the widget object itself.

# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view

Methods



In [27]:

    
# View just the structure
view = my_protein.representative_structure.view_structure(recolor=True)
view



In [28]:

    
# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7), 
                                    use_representatives=True)









    



[2019-08-10 23:40] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and 17
[2019-08-10 23:40] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and 4



In [29]:

    
# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)









    



[2019-08-10 23:40] [ssbio.core.protein] INFO: Active site at sequence residue 183, structure residue 183
[2019-08-10 23:40] [ssbio.core.protein] INFO: Active site at sequence residue 190, structure residue 190
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 102, structure residue 102
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 102, structure residue 102
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 191, structure residue 191
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 260, structure residue 260
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 287, structure residue 287
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 445, structure residue 445
[2019-08-10 23:40] [ssbio.core.protein] INFO: Binding site at sequence residue 451, structure residue 451

Saving



In [30]:

    
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))









    



[2019-08-10 23:40] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2019-08-10 23:40] [ssbio.io] INFO: Saved <class 'ssbio.core.protein.Protein'> (id: SRR1753782_00918) to /var/folders/0c/_6vlxjxj4tj78vkyl_3z9p0h0000gn/T/SRR1753782_00918/SRR1753782_00918.json

	pdb_chain_id	hit_score	hit_evalue	hit_percent_similar	hit_percent_ident	hit_num_ident	hit_num_similar
pdb_id
2zya	A	2319.0	0.0	0.987179	0.963675	451	462
2zya	B	2319.0	0.0	0.987179	0.963675	451	462
2zyd	A	2319.0	0.0	0.987179	0.963675	451	462
2zyd	B	2319.0	0.0	0.987179	0.963675	451	462
2zyg	A	2284.0	0.0	0.982906	0.950855	445	460